Welcome back!

Updates

  • Exam for exchange students: 21.12.2023 at 16:15 in room 01-013.
  • The mock exam is online 💣

Part II: Data gathering and preparation

Date Topic
16.11.2023 Data preparation and manipulation
23.11.2023 Basic statistics and data analysis with R
23.11.2023 Exercises/Workshop 4: Data gathering, data import
30.11.2023 Guest Lecture: Matteo Courthoud (Senior Economist and Data Scientist @Zalando)

Part III: Analysis, visualisation, output

Date Topic
07.12.2023 Visualisation, dynamic documents
07.12.2023 Exercises/Workshop 5: Data preparation and applied data analysis with R
14.12.2023 Guest Lecture: Florian Chatagny (Head of Data Science @Federal Finance Administration in Bern)
21.12.2023 Exercises/Workshop 6: Visualization, dynamic documents
21.12.2023 Summary, Wrap-Up, Q&A, Feedback
21.12.2023 Exam for Exchange Students

Work with Data

Warm up

JSON files: open-ended question

Be the JSON file

{
  "students": [
    {
      "id": 19091,
      "firstName": "Peter",
      "lastName": "Mueller",
      "grades": {
          "micro": 5,
          "macro": 4.5,
          "data handling": 5.5
          }
    },
    {
      "id": 19092,
      "firstName": "Anna",
      "lastName": "Schmid",
      "grades": {
          "micro": 5.25,
          "macro": 4,
          "data handling": 5.75
          }
    },
    {
      "id": 19093,
      "firstName": "Noah",
      "lastName": "Trevor",
      "grades": {
          "micro": 4,
          "macro": 4.5,
          "data handling": 5
          }
    }
  ]
}

Write an R code to extract a table with, as a first column, a vector of first names, and as a second column, the average grade per student. The table can be a data frame or a tibble.

XML:

<students>
  <student>
    <id>19091</id>
    <firstName>Peter</firstName>
    <lastName>Mueller</lastName>
    <grades>
      <micro>5</micro>
      <macro>4.5</macro>
      <dataHandling>5.5</dataHandling>
    </grades>
  </student>
  <student>
    <id>19092</id>
    <firstName>Anna</firstName>
    <lastName>Schmid</lastName>
    <grades>
      <micro>5.25</micro>
      <macro>4</macro>
      <dataHandling>5.75</dataHandling>
    </grades>
  </student>
  <student>
    <id>19093</id>
    <firstName>Noah</firstName>
    <lastName>Trevor</lastName>
    <grades>
      <micro>4</micro>
      <macro>4.5</macro>
      <dataHandling>5</dataHandling>
    </grades>
  </student>
</students>
  • ‘students’ is the root-node, ‘grades’ are its children
  • the siblings of Noah Trevor are Anna Schmid and Peter Mueller
  • The code below would be an alternative, equivalent notation for the third student in the xml file above.
<student id="19093" firstName="Noah" lastName="Trevor">
      <grades micro="4" macro="4.5" dataHandling="5" />
</student>

Data Gathering Procedure

A Template/Blueprint

Tell your future self what this script is all about

#######################################################################
# Project XY: Data Gathering and Import
#
# This script is the first part of the data pipeline of project XY.
# It imports data from ...
# Input: links to data sources (data comes in ... format)
# Output: cleaned data as CSV
# 
# U. Matter, St. Gallen, 2018
#######################################################################


# SET UP --------------
# load packages
library(tidyverse)

# set fix variables
INPUT_PATH <- "/rawdata"
OUTPUT_FILE <- "/final_data/datafile.csv"


# IMPORT RAW DATA FROM CSVs -------------

Goals for today

Goals for today: cognitive goals

  • Recognize where the problems are in a given dataset, and what is in the way of a proper analysis of the data.
  • Organize your work: what needs to be addressed first?

Goals for today: skills

  • Use simple string-operations to clean text variables.
  • Reshape datasets from wide to long (and vice versa).
  • Apply row-binding/stacking of datasets

The dataset is imported, now what?

  • In practice: still a long way to go.
  • Parsable, but messy data: Inconsistencies, data types, missing observations, wide format.

The dataset is imported, now what?

  • In practice: still a long way to go.
  • Parsable, but messy data: Inconsistencies, data types, missing observations, wide format.
  • Goal of data preparation: Dataset is ready for analysis.
  • Key conditions:
    1. Data values are consistent/clean within each variable.
    2. Variables are of proper data types.
    3. Dataset is in ‘tidy’ (in long format, more on this after the break)!

“Garbage in garbage out”…

Move to Nuvolos

Tidy data: some vocabulary

Following Wickham (2014):

  • Dataset: Collection of values (numbers and strings).
  • Every value belongs to a variable and an observation.
  • Variable: Contains all values that measure the same underlying attribute across units.
  • Observation: Contains all values measured on the same unit (e.g., a person).

Tidy data

Reshaping: the concept

Stack/row-bind: the concept

Move to Nuvolos

Q&A

References